Connectivity and population structure gafftopsail catfish

Supplementary Data: Methods & Results

SJ O’Leary

2022-07-11

Data set used for analysis:

## /// GENIND OBJECT /////////
## 
##  // 367 individuals; 5,554 loci; 18,606 alleles; size: 30.7 Mb
## 
##  // Basic content
##    @tab:  367 x 18606 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 2-25)
##    @loc.fac: locus factor for the 18606 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: read.genepop(file = "data/POPGEN/BMA_by_pop_genepop.gen", ncode = 3L, 
##     quiet = FALSE)
## 
##  // Optional content
##    @pop: population of each individual (group size range: 20-52)
##    @strata: a data frame with 10 columns ( LIB_ID, PLATE, WELL, SAMPLE_ID, LAT, LONG, ... )

1 Sample locations/Regions

Samples were collected in or near estuaries throughout the Gulf of Mexico and southwest Atlantic, individual geographic coordinates were not available for all samples. Individuals are grouped by putative populations based on sample location. Estuaries are grouped by geographic regions (Northern, Central, Western, Southern Gulf, and Southwest Atlantic). Geographic regions are categorized as being located in the Gulf and Atlantic ocean basins, though it should be noted that the Atlantic consists of a single sample.

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/soleary/GAFFTOPS/BMA_POPGEN/data/BASEMAPS", layer: "ne_10m_land"
## with 1 features
## It has 2 fields
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/soleary/GAFFTOPS/BMA_POPGEN/data/BASEMAPS", layer: "ne_10m_admin_1_states_provinces_lines"
## with 10118 features
## It has 21 fields

Sample distribution of gafftopsail catfish throughout the Gulf of Mexico and Atlantic Ocean

Sample distribution of gafftopsail catfish throughout the Gulf of Mexico and Atlantic Ocean

Sample distribution by putative population, geographic region, and ocean basin.

Number of sample per putative population, Florida Atlantic (FLA), South and North Florida (FLGS, FLGN), Alabama (MB), Mississippi (MISS), Louisiana (CS, LA), Texas (CC), and Campeche (SG).

POP n
FLA 41
FLGS 51
FLGN 41
MB 52
MISS 37
CS 20
LA 48
CC 50
CAMP 27

Number of samples per geographic region, Southwest Atlantic (Florida), Eastern Gulf (North, South Florida), Central Gulf (Alabama, Mississippi, Louisiana), Western Gulf (Texas), Southern Gulf (Campeche).

REGION n
SWATL 41
EGULF 92
CGULF 157
WGULF 50
SGULF 27

Number of samples per ocean basin.

OCEAN n
ATL 41
GULF 326

2 Compare overall patterns of variance within data set

2.1 Test for heterogeneity using AMOVA

Homogeneity in allele and genotype distributions among samples tested using a single-level, locus-by-locus analysis of molecular variance (AMOVA) as implemented in Arlequin.

2.2 PCA

Perform PCA; missing values in allele count matrix replaced with mean values (across all individuals).

Percent variance explained by first 25 principle components

Percent variance explained by first 25 principle components

Top three principle components retained, which explain a cumulative 2.53% variance.

Distribution of individuals on first three principle components

Distribution of individuals on first three principle components

3 Population differentiation-based (FST) outlier detections

3.1 Model-based estimation of null distribution: Bayesian maximum likelihood (multinomial-Dirichlet model)

Run BayeScan for individuals grouped by sample location (putative populations)

BayeScan (Foll and Gaggiotti 2008Foll, Matthieu, and Oscar Gaggiotti. 2008. “A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective.” Genetics 180 (2): 977–93. https://doi.org/10.1534/genetics.108.092221.; Foll et al. 2010Foll, Matthieu, Martin C. Fischer, Gerald Heckel, and Laurent Excoffier. 2010. “Estimating population structure from AFLP amplification intensity.” Molecular Ecology 19 (21): 4638–47. https://doi.org/10.1111/j.1365-294X.2010.04820.x.) uses differences in allele frequencies between populations to identify Fst-outlier loci. The null model (distribution of Fst for neutral loci) is generated based on the multinomial-Dirichlet model (island model) which assumes allele frequencies of subpopulations are correlated through a common migrant gene pool from which they differ to varying degrees; this difference is measured by a subpopulation-specific Fst coefficient. This model can incorporate ecologically more realistic scenarios compared to a strict island model and is a robust method even when there are differing effective sizes and migration rates per subpopulation.

Selection is introduced by decomposing population-specific Fst coefficients into population-specific component (beta), shared by all loci, and locus-specific component (alpha), shared by all poulations using logistic regresstion. If locus-specific component is necessary to explain observed pattern of diversity (i.e. alpha significantly different from 0), departure from neutrality is inferred.

BayeScan calculates a posterior probability for model including selection using bayes factors. Multiple testing is needed to incorporate identifying loci as under selection by chance. The posterior odds (instead of bayes factors) are calculated to make the decision on whether loci should be considered outlier. Posterior odds are calculated as the ratio of posterior probabilities indicating how much more likely the model with selection is compared to netural model. Posterior probabilities can be used to directly control FDR (expected proportion of false positives amont outlier loci). q-value of each locus indicates the minimum FDR at which this locus may become significant, e.g. q-value > 0.05 means that 5% of corresponding outlier markers are expected to be false positives (5% threshold more stringent than corresponding p-value in classic statistics).

Use PGDSpider to convert genepop file to bayescan format.


# input file with individuals grouped by sample location
java -jar /usr/local/bin/PGDSpider2-cli.jar -inputfile data/POPGEN/BMA_by_pop_genepop.gen -inputformat GENEPOP -outputfile data/POPGEN/BMA.genotypes_bayescan.txt -outputformat GESTE_BAYE_SCAN -spid scr/genepop2bayscan.spid

Get list of loci in data set to be able to identify outlier loci by name.

Prior odds for neutral model are determined using -pr_odds and indicate how much more likely the neutral model (does not include locus-specific effect alpha) is compared to the model of select (includes locus-specific effect alpha). Prior odds = 10 indicates that the neutral model is 10x more likely. The test of selection becomes more conservative/less false positive outlier are detected with increasing prior odds. For large amount of data (loci) including many populations and individuals per population, the prior odds should have little influence on on results but for realistic data sets (< 20 populations), the choice of prior odds can have strong effect and should be adjusted based on the number of loci included in the data set. For < 1,000 loci, prior odds = 10 is reasonable for larger data sets (1,000 = 10,000), prior odds = 100 is more appropriate, for millions of markers prior odds = 10,000 may be necessary.

Short, successive pilot runs (-nbp) are used to to adjust acceptance rates between 0.25 - 0.45 (-pilot determines iterations) and choose proposal distribution for reversible jump by estimating mean and variance for all alpha’s under the saturated model (contains all alpha parematers), which is close to full conditional distribution and generally creates conservative enough parameters to ensure convergence.

Sample size (-n) corresponds to number of iterations the program will write out/use for estimation of parameters after initial burn-in (-burn).

Thinning interval (-thin) is number of intervals between two samples; reduces autocorrelation from data generated using Markov chain

Total number of interations is (sample size x thinning interval) + burn-in.


# run for individuals grouped by sample location
BayeScan2.1_linux64bits data/POPGEN/BMA.genotypes_bayescan.txt -od results/ -o BMA_pop.bayescanPO10kburn100kn10kthin50 -all_trace -threads 45 -n 10000 -thin 50 -nbp 25 -pilot 5000 -burn 50000 -pr_odds 10000 -out_pilot -out_freq

Parameters used:

##  [1] "There are 5554 loci."                        
##  [2] ""                                            
##  [3] "There are 9 populations."                    
##  [4] ""                                            
##  [5] "Burn in: 50000"                              
##  [6] "Thining interval: 50"                        
##  [7] "Sample size: 10000"                          
##  [8] "Resulting total number of iterations: 550000"
##  [9] "Nb of pilot runs: 25"                        
## [10] "Length of each pilot run: 5000"              
## [11] ""

Plot posterior distributions:

Full output of MCMC algorithm is in *.sel file. Each line corresponds to iteration of MCMC algorithm, columns contain iteration index (start after pilot runs/burn-in), logarithm of likelihood, Fst coefficient for every population, alpha coefficients for every locus.

Counting the null values of alpha gives the posterior probability for the neutral model (only written out if -all_trace flag enabled).

Trace of likelihood and values of Fst over iterations.

Trace of likelihood and values of Fst over iterations.

Verify that sample size used to estimate posteriors is sufficiently large. Effective sample size to estimate parameters can be smaller than value used for BayeScan run (10,000). MCMC explores the parameter space by moving in small steps. Therefore, two consecutive values will be strongly correlated; used thinning interval of 50 to reduce autocorrelation.

Check correlation between sampled parameter values for thinned chains used to estimate posterior probability. Effective sample size will be smaller than sample size used as input (10,000) if there is some correlation.

##  'mcmc' num [1:9999, 1:5565] 50050 50100 50150 50200 50250 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:5565] "iteration" "logL" "fst1" "fst2" ...
##  - attr(*, "mcpar")= num [1:3] 1 499901 50

Comparison of effective sample sizes for fst, alpha, and likelihood

Comparison of effective sample sizes for fst, alpha, and likelihood

Effective size of likelihood sample should be smaller than input value of 10,000, while Fst parameters less affected by correlation because correlation decreases more rapidly for Fst values than for likelihood.

Test for convergence

Test for non-convergence of chains using Geweke’s convergence diagnostic which compares the means of the first and last parts of the MC and reports the z-scores for each parameter.

For α = 0.05, critical values of z are – 1.96 and +1.96, i.e. if z values fall within those boundaries indicative of equality of means and therefore convergence of MCMC. On the otherhand z < -1.96 or z > 1.96 null hypothesis of equality of means should be rejected.

Comparison of z values for fst, alpha, and log likelihood.

Comparison of z values for fst, alpha, and log likelihood.

Compare Fst, alpha, and q-values

File *_fst.txt contains one locus per row (first column). Columns 2 - 4 correspond to posterior probability for model including selection (prob), log10 of posterior odds for model including slection (log10PO), q-value for model including selection (qval); these are related to the test of local adaptation, i.e. model including locus-specific effect alpha. Fifth column is estimated locus-specific effect alpha (alpha) which indicates the strength and direction (positive values indicate diversifying selection). Final column is the locus-specific Fst coefficient averaged over populations (fst).

Use q-value to determine identify loci putatively under the influence of selection.

Distribution of q-values.

Distribution of q-values.

Number of loci with q-value < 0.05

qval <= 0.05 n
FALSE 5539
TRUE 15

Number of loci with q-value < 0.01

qval <= 0.01 n
FALSE 5541
TRUE 13

Number of loci with q-value < 0.001

qval <= 0.001 n
FALSE 5542
TRUE 12

Distribution of estimated Fst and locus-specific alpha component

Distribution of estimated Fst and locus-specific alpha component.

Distribution of estimated Fst and locus-specific alpha component.

Identify outlier loci

Identify loci with q-value < 0.05.

Relationship log10(qvalue) and Fst per locus

Relationship log10(qvalue) and Fst per locus

A total of 15 Fst-outlier loci were identified using bayescan (p < 0.05).

3.2 Model-based estimation of null distribution: Coalescent simulations (island model)

The FDIST method (Beaumont and Nichols 1996Beaumont, M. A., and R. A. Nichols. 1996. “Evaluating loci for use in the genetic analysis of population structure.” Proceedings of the Royal Society B: Biological Sciences 263 (1377): 1619–26. https://doi.org/10.1098/rspb.1996.0237.; Excoffier and Lischer 2010Excoffier, Laurent, and Heidi E. L. Lischer. 2010. “Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows.” Molecular Ecology Resources 10 (3): 564–67. https://doi.org/10.1111/j.1755-0998.2010.02847.x.) is implemented in Arlequin using a strict island model. Coalescent simulations basd on the distribution of heterozygosity in the empirical data set are used to create null distributions for Fst for a given number of demes.

Run FDIST method implmented in Arlequin

Use PGDSpider to convert genepop to arlequin (*.arp) files. Need to define Structure of demes in order for Arlequin to run. Assume all in one group. Use Arlequin to create settings files; make sure linux line endings and *.ars settings files are set as executable.

Run all models for 20,000 simulations. For island model simulate 3, 4, 9, 25, 50, and 100 demes. Execute bash script to run different combinations of input files and settings files and move/rename output files into results directory.


./scr/runFDIST_Arlequin.sh

Evaluate simulated and observed Fst-He distributions

Fst-heterozygosity distribution for simulated and observed data set using nine demes (number of putative populations assumed in data set). Color of individual points depict levels significance (darker colors indicate smaller values).

Fst-heterozygosity distribution for simulated and observed data set using nine demes (number of putative populations assumed in data set). Color of individual points depict levels significance (darker colors indicate smaller values).

Identify outlier loci potentially under selection and compare patterns of variance

Identify loci with corrected p-value < 0.05.

A total of 48 Fst-outlier loci were identified using arlequin (p < 0.05).

3.3 Compare identified Fst-outlier across methods

Identify sets of loci shared between methods.

Comparison of set sizes per outlier method and intersecting sets of loci.

Comparison of set sizes per outlier method and intersecting sets of loci.

4 Genetic-environmental association analysis

4.1 Spatial analysis: RDA using spatial information

Redundancy analysis is an ordination method to determine how much variation of one set of variables (explanatory variables) explains the variation in another set of variables (response variables), i.e. it is the multivariate analog of a simple linear regression. It assumes that the relationship between the dependent (response) and independent (explanatory) variables is linear.

Here, it can be applied to understand which geographic and/or environmental factors explain the overall observed pattern of genetic variation on the landscape (Oksanen et al. 2010Oksanen, Jari, Roeland Kindt, Pierre Legendre, Bob O Hara, Gavin L Simpson, Petery Solymos, M Henry H Stevens, and Helene Wagner. 2010. “vegan: Community Ecology Package. R package version 1.17-3.” October 10 (01): 2008. http:/.; Forester et al. 2016Forester, Brenna R., Matthew R. Jones, Stéphane Joost, Erin L. Landguth, and Jesse R. Lasky. 2016. “Detecting spatial genetic signatures of local adaptation in heterogeneous landscapes.” Molecular Ecology 25 (1): 104–20. https://doi.org/10.1111/mec.13476., 2018Forester, Brenna R., Jesse R. Lasky, Helene H. Wagner, and Dean L. Urban. 2018. “Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations.” Molecular Ecology 27 (9): 2215–33. https://doi.org/10.1111/mec.14584.).

Response variables: Genotypes

Format response variables (genetic data) as allele counts per locus and individual, homozygous calls are coded as 0 or 2, heterozygotes as 1. RDA requires a complete data set, therefore missing values are replaced with mean allele frequency.

All alleles are coded as 0, 1, or 2 therefore, scaling/centering should not be necessary.

setPop(gen) <- ~POP

# replace missing data
nomissing <- missingno(gen, type = "mean")

# extract allele counts
allelecount <- as.data.frame(tab(nomissing))

The repsonse variables consist of 18606 loci genotyped for 367 individuals across 9 samples estuaries.

Explanatory variables: Moran’s Eigenvector Maps (MEMs)

Calculation of spatial eigenvector maps requires a distance matrix. Use great circle distance calculated using lat/long information for individual samples. Use this matrix to calculate db-MEMs. Keep default setting for truncation, i.e. the length of the longest edge of the minimum spanning tree will be used as the threshold. Only positive MEMs are retained.

MEMs are orthogonal vectors with a unit norm that maximizes spatial autocorrelation (Griffith 1996Griffith, Daniel A. 1996. “Spatial autocorrelation and eigenfunctions of the geographic weights matrix accompanying geo-referenced data.” Canadian Geographer 40 (4): 351–67. https://doi.org/10.1111/j.1541-0064.1996.tb00462.x.; Dray et al. 2012Dray, S., R. Pélissier, P. Couteron, M. J. Fortin, P. Legendre, P. R. Peres-Neto, E. Bellier, et al. 2012. “Community ecology in the age of multivariate multiscale spatial analysis.” John Wiley & Sons, Ltd. https://doi.org/10.1890/11-1183.1.).

This gives us 6 MEMs.

Plot them using adegraphics

Visualize the MEM values - 1st MEMs are large spatial scales, spatial scales increasingly smaller for smaller MEMs (this is independent of genetic patterns; purely spatial/sampling design).

Fig 2: dbMEM values per samples. Positive values are blue, negative values are yellow, the size of the circles are scaled by magnitude.

Fig 2: dbMEM values per samples. Positive values are blue, negative values are yellow, the size of the circles are scaled by magnitude.

Model selection

Select best model for spatial variables using Forward model selection using permutation tests on rda object, a model is defined and a given scope of models considered. The function alternates with drop and add steps and stops when model is not changed during one step.

# run initial rda
rda.xy = rda(allelecount ~ ., data.frame(dbMEM), scale = FALSE)

stp.xy = ordiR2step(rda(allelecount ~ 1, data.frame(dbMEM)), 
                   scope = formula(rda.xy), 
                   R2scope = TRUE,
                   scale = FALSE, 
                   Pin = 0.05,
                   Pout = 0.1,
                   direction="both", 
                   R2permutations = 999,
                   parallel = 45)
## Step: R2.adj= 0 
## Call: allelecount ~ 1 
##  
##                    R2.adjusted
## <All variables>  0.00794247207
## + MEM1           0.00536696607
## + MEM3           0.00108849267
## + MEM2           0.00094741207
## + MEM4           0.00036982618
## + MEM5           0.00013556168
## <none>           0.00000000000
## + MEM6          -0.00007458758
## 
##        Df    AIC      F Pr(>F)   
## + MEM1  1 2956.5 2.9749  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.005366966 
## Call: allelecount ~ MEM1 
##  
##                 R2.adjusted
## <All variables> 0.007942472
## + MEM3          0.006473194
## + MEM2          0.006331725
## + MEM4          0.005752553
## + MEM5          0.005517645
## <none>          0.005366966
## + MEM6          0.005306918
## 
##        Df    AIC      F Pr(>F)   
## + MEM3  1 2957.1 1.4064  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.006473194 
## Call: allelecount ~ MEM1 + MEM3 
##  
##                 R2.adjusted
## <All variables> 0.007942472
## + MEM2          0.007443658
## + MEM4          0.006862890
## + MEM5          0.006627335
## <none>          0.006473194
## + MEM6          0.006416027
## 
##        Df    AIC      F Pr(>F)   
## + MEM2  1 2957.8 1.3559  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.007443658 
## Call: allelecount ~ MEM1 + MEM3 + MEM2 
##  
##                 R2.adjusted
## <All variables> 0.007942472
## + MEM4          0.007837112
## + MEM5          0.007600906
## <none>          0.007443658
## + MEM6          0.007389015
## 
##        Df    AIC     F Pr(>F)   
## + MEM4  1 2958.6 1.144  0.002 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step: R2.adj= 0.007837112 
## Call: allelecount ~ MEM1 + MEM3 + MEM2 + MEM4 
##  
##                 R2.adjusted
## + MEM5          0.007995885
## <All variables> 0.007942472
## <none>          0.007837112
## + MEM6          0.007783407

Run RDA

The selected model consists of MEM1, MEM3, MEM2, and MEM4.

xy_selected <- dbMEM %>%
  as.data.frame() %>%
  select(one_of(attributes(stp.xy$terms)$term.labels))

rda.xy <- rda(allelecount ~ ., data = xy_selected, scale = FALSE)

Determine proportion of genetic variance explained by geographic data by running RDA using the selected spatial variables as the explanatory variables, i.e. fitting linear trend surface (accounting for linear geographic scale).

## Call: rda(formula = allelecount ~ MEM1 + MEM3 + MEM2 + MEM4, data =
## xy_selected, scale = FALSE)
## 
##                  Inertia Proportion Rank
## Total         3152.38816    1.00000     
## Constrained     58.88794    0.01868    4
## Unconstrained 3093.50022    0.98132  362
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##   RDA1   RDA2   RDA3   RDA4 
## 28.865 12.301  9.112  8.610 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 33.03 26.05 13.72 13.66 13.19 13.16 13.07 13.04 
## (Showing 8 of 362 unconstrained eigenvalues)

Test for significance using a permutation test to generate p-value indicating whether to reflect null hypothesis (geography does not explain genetic variation). Rows of unconstrained matrix (genetic data set) are repeatedly randomized for n permutations. Relationship is considered significant if observed relationship is stronger than the randomly permuted relationships.

anova.cca(rda.xy, 
          permutations = 1000,
          parallel = 45)
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 1000
## 
## Model: rda(formula = allelecount ~ MEM1 + MEM3 + MEM2 + MEM4, data = xy_selected, scale = FALSE)
##           Df Variance      F   Pr(>F)    
## Model      4    58.89 1.7228 0.000999 ***
## Residual 362  3093.50                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Proportion of variance explained by the environmental predictors is Proportion column of Constrained row in summary table (equivalent to R2 of multiple regression).

R2 (0.0187) needs to be adjusted based on the number of predictors, as the number of explanatory variables inflates the apparent amount of explained variance due to random correlation. The adjusted R2 values is 0.0078

## Importance of components:
##                          RDA1    RDA2   RDA3   RDA4
## Eigenvalue            28.8651 12.3006 9.1121 8.6101
## Proportion Explained   0.4902  0.2089 0.1547 0.1462
## Cumulative Proportion  0.4902  0.6991 0.8538 1.0000

Fig 3: Eigenvalues for the constrained axes reflect the variance explained by each canonical axis

Fig 3: Eigenvalues for the constrained axes reflect the variance explained by each canonical axis

Compare constraints (explanatory variables).

Fig 4: Biplot of constraints.

Fig 4: Biplot of constraints.

Use Mahalanobis distance to identify alleles with strongest association to first three RDA axis.

Fig 5: Distribution of alleles; alleles from loci with at least one allele flagged as non-associated (grey) vs outlier (red) based on Mahalanobis distance threshold of 18.46 for four degrees of freedom (p < 0.001).

Fig 5: Distribution of alleles; alleles from loci with at least one allele flagged as non-associated (grey) vs outlier (red) based on Mahalanobis distance threshold of 18.46 for four degrees of freedom (p < 0.001).

Table 1a: Number of loci significantly associated allele (for bivariate distribution critical value = 13.82 (p < 0.001))

MAH_DIST >= 18.46 n
FALSE 5166
TRUE 388

Table 1b: Number of loci significantly associated allele (for bivariate distribution critical value = 25 (p < 0.001))

MAH_DIST >= 25 n
FALSE 5396
TRUE 158

Compare clustering of individual based on weighted average individual scores, i.e. weighted averages of allele scores that are as similar to linear combination scores as possible. Weights are individual totals of alleles. To determine how well explanatory variables separate groups of individuals or if explanatory variables can be used to discriminate between groups of individuals, use wa-scores.

Fig 6: RDA biplot indicating spatial variables (arrows) and clustering of individuals on RDA1-3.

Fig 6: RDA biplot indicating spatial variables (arrows) and clustering of individuals on RDA1-3.

4.2 Environmental analysis

Response variables: Genotypes

Format response variables (genetic data) as allele counts per locus and individual, homozygous calls are coded as 0 or 2, heterozygotes as 1. RDA requires a complete data set, therefore missing values are replaced with mean allele frequency.

All alleles are coded as 0, 1, or 2 therefore, scaling/centering should not be necessary.

setPop(gen) <- ~POP

# replace missing data
nomissing <- missingno(gen, type = "mean")

# extract allele counts
allelecount <- as.data.frame(tab(nomissing))

The repsonse variables consist of 18606 loci genotyped for 367 individuals across 9 samples estuaries.

Explanatory variables: Bioclim variables

Download & Format environmental parameters

# access marine data sets
marine <- list_datasets(terrestrial = FALSE, marine = TRUE)

# identify available layers & break into sets
layers <- list_layers(marine) %>%
  separate(layer_code, into = c("dataset", "parameter", "stat"), extra = "merge", remove = FALSE) %>%
  filter(!dataset == "BO2") %>%
  mutate(SET = case_when(dataset == "MS" ~ "MARSPEC",
                         dataset == "BO" ~ "BIO_CLIM",
                         grepl("phytoplankton", description, ignore.case = TRUE) ~ "carb_phytopl",
                         grepl("Chlorophyll", description, ignore.case = TRUE) ~ "chlorophyll",
                         grepl("velocity", description, ignore.case = TRUE) ~ "velocity",
                         grepl("oxygen", description, ignore.case = TRUE) ~ "DO",
                         grepl("Iron", description, ignore.case = TRUE) ~ "Iron",
                         grepl("Light", description, ignore.case = TRUE) ~ "Light",
                         grepl("Nitrate", description, ignore.case = TRUE) ~ "Nitrate",
                         grepl("Phosphate", description, ignore.case = TRUE) ~ "Phosphate",
                         grepl("Primary", description, ignore.case = TRUE) ~ "PrimaryProd",
                         grepl("ice", description, ignore.case = TRUE) ~ "SeaIce",
                         grepl("surface", description, ignore.case = TRUE) ~ "SeaSurface",
                         grepl("water", description, ignore.case = TRUE) ~ "SeaWater",
                         grepl("Silicate", description, ignore.case = TRUE) ~ "Silicate"))

Loop through layers to download and extract values for coordinates.

# vector with sets
set <- unique(layers$SET)

set <- set[1:2]

for(s in set){
  
  cat(s,"start download ...")
  
  # data frame for results
  env <- strata %>%
    dplyr::select(LIB_ID, POP, LAT, LONG)
  
  # subset layers
  layer_subset <- layers %>%
    filter(SET == s)
  
  # filename
  path <- glue("data/POPGEN/{s}_env.param")

  # download all the layers in the set
  for(l in layer_subset$layer_code){

    # download layer
    env_layer <- load_layers(l)

    # coordinates
    xy <- xy

    # extract values for xy coordinates
    param <- raster::extract(x = env_layer,
                     y = xy,
                     method = "bilinear") %>%
        as.data.frame() %>%
        magrittr::set_names(l)

    # combine results
    env <- env %>%
      bind_cols(param)
    
    # write to file
    write_delim(env, path, delim = "\t")

    }
  
}

Load variables for all sets of variables.

Model selection

We are going to use only the original BIO_CLIM and MARSPEC data sets.

Table 2: MARSPEC and BIO-CLIM variables used for model selection.

layer_code description
BO_calcite Calcite concentration indicates the mean concentration of calcite (CaCO3) in oceans.
BO_chlomax Chlorophyll A concentration indicates the concentration of photosynthetic pigment chlorophyll A (the most common green chlorophyll) in oceans. Please note that in shallow water these values may reflect any kind of autotrophic biomass.
BO_chlomean Chlorophyll A concentration indicates the concentration of photosynthetic pigment chlorophyll A (the most common green chlorophyll) in oceans. Please note that in shallow water these values may reflect any kind of autotrophic biomass.
BO_chlomin Chlorophyll A concentration indicates the concentration of photosynthetic pigment chlorophyll A (the most common green chlorophyll) in oceans. Please note that in shallow water these values may reflect any kind of autotrophic biomass.
BO_chlorange Chlorophyll A concentration indicates the concentration of photosynthetic pigment chlorophyll A (the most common green chlorophyll) in oceans. Please note that in shallow water these values may reflect any kind of autotrophic biomass.
BO_cloudmean Cloud fraction indicates how much of the earth is covered by clouds.
BO_cloudmin Cloud fraction indicates how much of the earth is covered by clouds.
BO_damax The diffuse attenuation coefficient is an indicator of water clarity. It expresses how deeply visible light in the blue to the green region of the spectrum penetrates in to the water column.
BO_damean The diffuse attenuation coefficient is an indicator of water clarity. It expresses how deeply visible light in the blue to the green region of the spectrum penetrates in to the water column.
BO_damin The diffuse attenuation coefficient is an indicator of water clarity. It expresses how deeply visible light in the blue to the green region of the spectrum penetrates in to the water column.
BO_dissox Dissolved oxygen concentration [02]
BO_nitrate This layer contains both [NO3] and [NO3+NO2] data. By this we mean chemically reactive dissolved inorganic nitrate and nitrate or nitrite. (It is important to note that data reported as [NO3] in the WOD09 should be used with caution because it is difficult to verify that the [NO3] (nitrate) data are [NO3+NO2] or [NO3]. (Boyer et al. 2009))
BO_parmax Photosynthetically Available Radiation (PAR) indicates the quantum energy flux from the Sun (in the spectral range 400-700 nm) reaching the ocean surface.
BO_parmean Photosynthetically Available Radiation (PAR) indicates the quantum energy flux from the Sun (in the spectral range 400-700 nm) reaching the ocean surface.
BO_ph Measure of acidity in the ocean.
BO_phosphate Reactive ortho-phosphate concentration [HPO4^-2] in the ocean.
BO_salinity Salinity indicates the dissolved salt content in the ocean.
BO_silicate This variable indicates the concentration of silicate or ortho-silicic acid [Si(OH)4] in the ocean
BO_sstmax Sea surface temperature is the temperature of the water at the ocean surface. This parameter indicates the temperature of the topmost meter of the ocean water column.
BO_sstmean Sea surface temperature is the temperature of the water at the ocean surface. This parameter indicates the temperature of the topmost meter of the ocean water column.
BO_sstmin Sea surface temperature is the temperature of the water at the ocean surface. This parameter indicates the temperature of the topmost meter of the ocean water column.
BO_sstrange Sea surface temperature is the temperature of the water at the ocean surface. This parameter indicates the temperature of the topmost meter of the ocean water column.
BO_bathymin Minimum depth of the seafloor
BO_bathymax Maximum depth of the seafloor
BO_bathymean Average depth of the seafloor
MS_bathy_5m Depth of the seafloor
MS_biogeo01_aspect_EW_5m East/West Aspect (sin(aspect in radians))
MS_biogeo02_aspect_NS_5m North/South Aspect (cos(aspect in radians))
MS_biogeo03_plan_curvature_5m Plan curvature is the curvature in the direction perpendicular to the maximum slope and indicates whether flow across a surface would diverge (positive values) or converge (negative values).
MS_biogeo04_profile_curvature_5m Profile curvature is the curvature in the direction parallel to the maximum slope and indicates whether flow across a surface would accelerate (positive values) or decelerate (negative values).
MS_biogeo05_dist_shore_5m ’’
MS_biogeo06_bathy_slope_5m Bathymetric slope was measured in degrees ranging from 0??? (flat surface) to 90??? (vertical slope).
MS_biogeo07_concavity_5m Concavity is the second derivative of the bathymetry layer (or the slope of the slope) and represents whether a raster cell is on a hill (negative values) or in a valley (positive values).
MS_biogeo08_sss_mean_5m Measurements of sea surface salinity (SSS) were obtained from in situ oceanographic observations compiled by NOAA?s World Ocean Atlas 2009 (WOA09, Antonov et al. 2010).
MS_biogeo09_sss_min_5m Measurements of sea surface salinity (SSS) were obtained from in situ oceanographic observations compiled by NOAA?s World Ocean Atlas 2009 (WOA09, Antonov et al. 2010).
MS_biogeo10_sss_max_5m Measurements of sea surface salinity (SSS) were obtained from in situ oceanographic observations compiled by NOAA?s World Ocean Atlas 2009 (WOA09, Antonov et al. 2010).
MS_biogeo11_sss_range_5m Measurements of sea surface salinity (SSS) were obtained from in situ oceanographic observations compiled by NOAA?s World Ocean Atlas 2009 (WOA09, Antonov et al. 2010).
MS_biogeo12_sss_variance_5m Measurements of sea surface salinity (SSS) were obtained from in situ oceanographic observations compiled by NOAA?s World Ocean Atlas 2009 (WOA09, Antonov et al. 2010).
MS_biogeo13_sst_mean_5m Satellite measures of sea surface temperature (SST) were obtained at a 2.5 arc-minute resolution (approximately 4 km_) from Aqua-MODIS 4-micron nighttime SST Level 3 standard mapped image products, downloaded from NASA’s Ocean Color website (http://oceancolor.gsfc.nasa.gov/).
MS_biogeo14_sst_min_5m Satellite measures of sea surface temperature (SST) were obtained at a 2.5 arc-minute resolution (approximately 4 km_) from Aqua-MODIS 4-micron nighttime SST Level 3 standard mapped image products, downloaded from NASA’s Ocean Color website (http://oceancolor.gsfc.nasa.gov/).
MS_biogeo15_sst_max_5m Satellite measures of sea surface temperature (SST) were obtained at a 2.5 arc-minute resolution (approximately 4 km_) from Aqua-MODIS 4-micron nighttime SST Level 3 standard mapped image products, downloaded from NASA’s Ocean Color website (http://oceancolor.gsfc.nasa.gov/).
MS_biogeo16_sst_range_5m Satellite measures of sea surface temperature (SST) were obtained at a 2.5 arc-minute resolution (approximately 4 km_) from Aqua-MODIS 4-micron nighttime SST Level 3 standard mapped image products, downloaded from NASA’s Ocean Color website (http://oceancolor.gsfc.nasa.gov/).
MS_biogeo17_sst_variance_5m Satellite measures of sea surface temperature (SST) were obtained at a 2.5 arc-minute resolution (approximately 4 km_) from Aqua-MODIS 4-micron nighttime SST Level 3 standard mapped image products, downloaded from NASA’s Ocean Color website (http://oceancolor.gsfc.nasa.gov/).
MS_sss01_5m Average sea surface salinity for the month january.
MS_sss02_5m Average sea surface salinity for the month february.
MS_sss03_5m Average sea surface salinity for the month march
MS_sss04_5m Average sea surface salinity for the month april.
MS_sss05_5m Average sea surface salinity for the month may.
MS_sss06_5m Average sea surface salinity for the month june.
MS_sss07_5m Average sea surface salinity for the month july.
MS_sss08_5m Average sea surface salinity for the month august.
MS_sss09_5m Average sea surface salinity for the month september.
MS_sss10_5m Average sea surface salinity for the month october.
MS_sss11_5m Average sea surface salinity for the month november.
MS_sss12_5m Average sea surface salinity for the month december.
MS_sst01_5m Average sea surface temperature for the month january.
MS_sst02_5m Average sea surface temperature for the month february.
MS_sst03_5m Average sea surface temperature for the month march
MS_sst04_5m Average sea surface temperature for the month april.
MS_sst05_5m Average sea surface temperature for the month may.
MS_sst06_5m Average sea surface temperature for the month june.
MS_sst07_5m Average sea surface temperature for the month july.
MS_sst08_5m Average sea surface temperature for the month august.
MS_sst09_5m Average sea surface temperature for the month september.
MS_sst10_5m Average sea surface temperature for the month october.
MS_sst11_5m Average sea surface temperature for the month november.
MS_sst12_5m Average sea surface temperature for the month december.

Run model selection using stepwise process.

# format data for rda
env_param <- env_param_tidy %>%
  filter(PARAMETER %in% select$layer_code) %>%
  group_by(PARAMETER) %>%
  mutate(VALUES = replace(VALUES, is.na(VALUES), mean(VALUES, na.rm=TRUE))) %>%
  pivot_wider(names_from = "PARAMETER", values_from = "VALUES") %>%
  column_to_rownames("LIB_ID")

write_delim(env_param, "data/POPGEN/bioclim.param", delim = "\t")

# run initial rda with all parameters
rda <- rda(allelecount ~ ., data = env_param, scale = TRUE)

# stepwise selection
stp.env = ordiR2step(rda(allelecount ~ 1, env_param), 
                   scope = formula(rda), 
                   R2scope = TRUE,
                   scale = FALSE, 
                   Pin = 0.05,
                   Pout = 0.1,
                   direction="both", 
                   R2permutations = 999,
                   parallel = 35)

Select best model for spatial variables using step-wise model selection based on permutation tests of rda object, a model is defined and a given scope of models considered. The function alternates with drop and add steps and stops when model is not changed during one step.

The best model consists of MS_sst06_5m, MS_bathy_5m, BO_phosphate, BO_dissox, and BO_parmean.

Distribution of values per sample location for each parameter selected for the environmental RDA model.

Distribution of values per sample location for each parameter selected for the environmental RDA model.

Table 3: Variables selected using stepwise model selection based on permutation tests of rda object for Pin = 0.05 and Pout = 0.1.

layer_code description
BO_dissox Dissolved oxygen concentration [02]
BO_parmean Photosynthetically Available Radiation (PAR) indicates the quantum energy flux from the Sun (in the spectral range 400-700 nm) reaching the ocean surface.
BO_phosphate Reactive ortho-phosphate concentration [HPO4^-2] in the ocean.
MS_bathy_5m Depth of the seafloor
MS_sst06_5m Average sea surface temperature for the month june.

Run RDA

Run RDA using selected principle components and test for significance.

env_model <-  env_param_tidy %>%
  filter(PARAMETER %in% selected) %>%
  group_by(PARAMETER) %>%
  mutate(VALUES = replace(VALUES, is.na(VALUES), mean(VALUES, na.rm=TRUE))) %>%
  pivot_wider(names_from = "PARAMETER", values_from = "VALUES") %>%
  column_to_rownames("LIB_ID")
  
rda.env <- rda(allelecount ~ ., data = env_model, scale = TRUE)

Determine the proportion of genetic variance explained by geographic data by running RDA using the selected environemtal variables as the explanatory variables.

## Call: rda(formula = allelecount ~ BO_dissox + BO_parmean + BO_phosphate
## + MS_bathy_5m + MS_sst06_5m, data = env_model, scale = TRUE)
## 
##                   Inertia  Proportion Rank
## Total         18606.00000     1.00000     
## Constrained     540.27325     0.02904    5
## Unconstrained 18065.72675     0.97096  361
## Inertia is correlations 
## 
## Eigenvalues for constrained axes:
##   RDA1   RDA2   RDA3   RDA4   RDA5 
## 188.06 157.52  87.76  57.92  49.02 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 93.17 87.13 83.63 82.72 81.91 80.67 80.36 80.11 
## (Showing 8 of 361 unconstrained eigenvalues)

Test for significance using a permutation test to generate p-value indicating whether to reflect the null hypothesis (environmental data does not explain genetic variation). Rows of unconstrained matrix (genetic data set) are repeatedly randomized for n permutations. Relationship is considered significant if observed relationship is stronger than the randomly permuted relationships.

anova.cca(rda.env, 
          permutations = 1000,
          parallel = 45)
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 1000
## 
## Model: rda(formula = allelecount ~ BO_dissox + BO_parmean + BO_phosphate + MS_bathy_5m + MS_sst06_5m, data = env_model, scale = TRUE)
##           Df Variance      F   Pr(>F)    
## Model      5    540.3 2.1592 0.000999 ***
## Residual 361  18065.7                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The adjusted R2 values is 0.0156

Compare results

## Importance of components:
##                           RDA1     RDA2    RDA3    RDA4     RDA5
## Eigenvalue            188.0583 157.5159 87.7640 57.9200 49.01506
## Proportion Explained    0.3481   0.2915  0.1624  0.1072  0.09072
## Cumulative Proportion   0.3481   0.6396  0.8021  0.9093  1.00000

Fig 7: Eigenvalues for the constrained axes reflect the variance explained by each canonical axis.

Fig 7: Eigenvalues for the constrained axes reflect the variance explained by each canonical axis.

Compare constraints (explanatory variables).

Fig 8: Biplot of constraints.

Fig 8: Biplot of constraints.

Use Mahalanobis distance to identify alleles with strongest association to first two RDA axis.

Fig 9: Distribution of alleles; alleles from loci with at least one allele flagged as non-associated (grey) vs outlier (red) based on Mahalanobis distance threshold of 20.51 for p < .001, (5 degrees of freedom)

Fig 9: Distribution of alleles; alleles from loci with at least one allele flagged as non-associated (grey) vs outlier (red) based on Mahalanobis distance threshold of 20.51 for p < .001, (5 degrees of freedom)

Table 4a: Number of loci significantly associated allele (for bivariate distribution critical value = 13.82 (p < 0.001))

MAH_DIST >= 20.51 n
FALSE 5436
TRUE 118

Table 4b: Number of loci significantly associated allele.

MAH_DIST >= 25 n
FALSE 5503
TRUE 51

Compare clustering of individual based on weighted average individual scores, i.e. weighted averages of allele scores that are as similar to linear combination scores as possible. Weights are individual totals of alleles. To determine how well explanatory variables separate groups of individuals or if explanatory variables can be used to discriminate between groups of individuals, use wa-scores.

Fig 10: RDA biplot indicating environmental variables (arrows) and clustering of individuals.

Fig 10: RDA biplot indicating environmental variables (arrows) and clustering of individuals.

4.3 Partitioning of variance

Perform RDA for all variables (spatial and environmental; full model).

## Call: rda(formula = allelecount ~ xy.mat + env.mat, scale = FALSE)
## 
##                  Inertia Proportion Rank
## Total         3152.38816    1.00000     
## Constrained    138.87963    0.04406    9
## Unconstrained 3013.50853    0.95594  357
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##  RDA1  RDA2  RDA3  RDA4  RDA5  RDA6  RDA7  RDA8  RDA9 
## 40.77 27.70 18.40  9.15  8.88  8.62  8.59  8.46  8.31 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 13.812 13.580 13.421 13.158 13.045 13.017 12.984 12.895 
## (Showing 8 of 357 unconstrained eigenvalues)

Test for significance of overall model.

The overall model is significant (P = 0.001).

Partition the variation in genetic data into components accounted for by environmental and spatial variables and their combined effect using adjusted R squared to assess partitions explained by explanatory tables and their combinations.

## 
## Partition of variance in RDA 
## 
## Call: varpart(Y = allelecount, X = ~xy.mat, ~env.mat)
## 
## Explanatory tables:
## X1:  ~xy.mat
## X2:  ~env.mat 
## 
## No. of explanatory tables: 2 
## Total variation (SS): 1153774 
##             Variance: 3152.4 
## No. of observations: 367 
## 
## Partition table:
##                      Df R.squared Adj.R.squared Testable
## [a+b] = X1            4   0.01868       0.00784     TRUE
## [b+c] = X2            5   0.03231       0.01890     TRUE
## [a+b+c] = X1+X2       9   0.04406       0.01996     TRUE
## Individual fractions                                    
## [a] = X1|X2           4                 0.00105     TRUE
## [b]                   0                 0.00678    FALSE
## [c] = X2|X1           5                 0.01212     TRUE
## [d] = Residuals                         0.98004    FALSE
## ---
## Use function 'rda' to test significance of fractions of interest

Extract fraction explained by each component:

Test significance of all components using permuted p-values.

Variance partitioning of variance explained by db_MEMs (xy), environmental variables (env), and shared effects due to correlation of xy and env.

PARTITION VARIANCE PVAL
residuals 0.9800441 NA
xy+env+shared 0.0199559 0.001
env+shared 0.0189025 0.001
env 0.0121188 0.001
xy+shared 0.0078371 0.001
shared 0.0067837 NA
xy 0.0010534 0.001

5 Create neutral and outlier data sets

5.1 Compare sets of extracted loci

Tidy data set containing loci (contigs) flagged as Fst-outlier for each method, and loci flagged as strongly associated with first three RDA axis for spatial and environmental RDAs and identify sets of loci shared between methods.

Fig. 14: Comparison of set sizes per outlier method and intersecting sets of loci.

Fig. 14: Comparison of set sizes per outlier method and intersecting sets of loci.

Fig. 14: Comparison of set sizes per outlier method and intersecting sets of loci.

Fig. 14: Comparison of set sizes per outlier method and intersecting sets of loci.

5.2 Create data sets

Outlier loci were defined as loci identified by at least one of the Fst-outlier methods or the RDA.

The outlier data set consists of 133loci.

Create neutral data set as loci remaining after all outlier loci are removed from data set.

The neutral data set consists of 5421 loci.

6 K-means clustering and Discriminant Analysis of Principle Components

K-means clustering can be used for model-free clustering based on similarity to identify clusters of individuals. Discriminant analysis of principle components (DAPC) implemented in adegenet (Jombart, Devillard, and Balloux 2010Jombart, Thibaut, Sébastien Devillard, and François Balloux. 2010. “Discriminant analysis of principal components: a new method for the analysis of genetically structured populations.” BMC Genetics 11 (1): 94. https://doi.org/10.1186/1471-2156-11-94.) can then used to determine membership probabilities of each sample to each inferred cluster. During DAPC, the genotype matrix is first PCA-transformed, samples are paritioned into a within- and between-group component to maximize discrimination between groups using linear discriminant analysis; the discriminant functions can be expressed as linear combinations of alleles and therefore, allele contributions to the identified patterns can be computed.

6.1 Outlier loci

Use k-means clustering to identify clusters based on genetic similarity (data transformed using PCA). For clustering it is appropriate to retain all PCs (all variance in the data set), as overfitting is not an issue.

Fig 1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

Fig 1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

Minimum value for AIC is obtained for K=10.

Perform stratified cross-validation of DAPC using a range of retained PCs, while keeping the number of discriminant functions fixed for K = 2 - 5 to determine the most appropriate number of principle components to retain sufficient variance to discriminate but not overfit the data.

Retaining too few PCs may result in important variance not being retained and therefore not informing the clustering analysis, while retaining too many PCs results in overfitting, i.e. assignment success decreases.

Fig 2: Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

Fig 2: Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

The optimum number of principle components, %-variance retained and the mean optimum assignment succes for K = 2 - 10 clusters.

k_clust optPC variance mean_opt_success
2 50 70.17 1.00
3 50 47.00 0.98
4 50 47.00 0.98
5 100 70.17 0.94

Genetic data is scaled and centered, then transformed using a PCA. Retained PCs are then transmitted to a linear discriminant analysis.

Calculate membership probability of each individual to per cluster.

Fig 3: Membership probability of each individual to clusters identified using k-means hierarchical clustering.

Fig 3: Membership probability of each individual to clusters identified using k-means hierarchical clustering.

Individuals are grouped within sample locations in individuals panels as FLA, FLGS, FLGN, MB, MISS, CS, LA, CC, and CAMP.

6.2 Neutral loci (Fst-outlier & environmentally-associated loci removed)

Use k-means clustering to identify clusters based on genetic similarity (data transformed using PCA). For clustering it is appropriate to retain all PCs (all variance in the data set), as overfitting is not an issue.

Fig 1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

Fig 1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.

Minimum value for AIC is obtained for K=4.

Perform stratified cross-validation of DAPC using a range of retained PCs, while keeping the number of discriminant functions fixed for K = 2 - 5 to determine the most appropriate number of principle components to retain sufficient variance to discriminate but not overfit the data.

Retaining too few PCs may result in important variance not being retained and therefore not informing the clustering analysis, while retaining too many PCs results in overfitting, i.e. assignment success decreases.

Fig 2: Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

Fig 2: Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.

The optimum number of principle components, %-variance retained and the mean optimum assignment succes for K = 2 - 10 clusters.

k_clust optPC variance mean_opt_success
2 50 64.01 1.00
3 50 36.20 1.00
4 50 20.25 1.00
5 150 50.72 0.94

Genetic data is scaled and centered, then transformed using a PCA. Retained PCs are then transmitted to a linear discriminant analysis.

Calculate membership probability of each individual to per cluster.

Fig 3: Membership probability of each individual to clusters identified using k-means hierarchical clustering.

Fig 3: Membership probability of each individual to clusters identified using k-means hierarchical clustering.

Individuals are grouped within sample locations in individuals panels as FLA, FLGS, FLGN, MB, MISS, CS, LA, CC, and CAMP.

7 Fst

7.1 AMOVA

Implemented in Arlequin

7.2 Pairwise Fst

Pairwise FST (Weir and Cockerham 1984Weir, B. S., and C. Clark Cockerham. 1984. “Estimating F-Statistics for the Analysis of Population Structure.” Evolution 38 (6): 1358. https://doi.org/10.2307/2408641.) was calculated as a post hoc test for significant differences between oceans, among geographic regions, and estuaries. Significance determined using permuted p-values (1,000 iterations) and 95% confidence intervals around each estimate generated by resampling loci 10,000 times with replacement using functions implemented in hierfstat and assigner.

PAIRWISE COMPARISON BETWEEN OCEAN BASINS

Pairwise Fst among ocean basins.

POP1 POP2 N_MARKERS FST CI_LOW CI_HIGH PVAL
GULF ATL 5554 0.027023 0.0257106 0.0283911 0.000999

Atlantic and Gulf samples are significantly different from each other.

PAIRWISE COMPARISONS AMONG GEOGRAPHIC REGIONS

Pairwise Fst among geographic regions. P-values and 95%-CI recored in table.

Pairwise Fst among geographic regions. P-values and 95%-CI recored in table.
POP1 POP2 N_MARKERS FST CI_LOW CI_HIGH PVAL
SGULF SWATL 5360 0.0512467 0.0489500 0.0536600 0.0009990
SWATL WGULF 5408 0.0298616 0.0282982 0.0314237 0.0009990
SWATL CGULF 5499 0.0288954 0.0274644 0.0303701 0.0009990
EGULF SWATL 5437 0.0267737 0.0254036 0.0281738 0.0009990
SGULF EGULF 5494 0.0258652 0.0244581 0.0272891 0.0009990
SGULF CGULF 5520 0.0248761 0.0233697 0.0264718 0.0009990
SGULF WGULF 5465 0.0227486 0.0212929 0.0242021 0.0009990
EGULF CGULF 5523 0.0051954 0.0047239 0.0056983 0.0009990
EGULF WGULF 5486 0.0051723 0.0046221 0.0057346 0.0009990
CGULF WGULF 5517 0.0001506 0.0000000 0.0004078 0.2107892

With the exception of the central and western Gulf, all geographic regions are significantly different from each other.

PAIRWISE COMPARISONS AMONG SAMPLE LOCATIONS

Pairwise Fst among sample locations. P-values and 95%-CI recored in table.

Pairwise Fst among sample locations. P-values and 95%-CI recored in table.
POP1 POP2 N_MARKERS FST CI_LOW CI_HIGH PVAL
CAMP FLA 5360 0.0512467 0.0488922 0.0536464 0.0009990
FLA CS 5184 0.0306518 0.0288356 0.0324278 0.0009990
FLA LA 5368 0.0303556 0.0287824 0.0319644 0.0009990
FLA CC 5408 0.0298616 0.0283038 0.0314266 0.0009990
FLA MISS 5335 0.0297276 0.0281899 0.0313440 0.0009990
FLA MB 5369 0.0285740 0.0270575 0.0301262 0.0009990
FLGS FLA 5352 0.0273936 0.0259640 0.0288965 0.0009990
FLGN FLA 5324 0.0269153 0.0254449 0.0284279 0.0009990
CAMP FLGS 5456 0.0264596 0.0249415 0.0280057 0.0009990
CAMP FLGN 5434 0.0251159 0.0236080 0.0265965 0.0009990
CAMP LA 5451 0.0250990 0.0234492 0.0268720 0.0009990
CAMP MB 5455 0.0249516 0.0233432 0.0265992 0.0009990
CAMP MISS 5426 0.0248768 0.0231738 0.0266535 0.0009990
CAMP CS 5361 0.0242077 0.0223701 0.0260826 0.0009990
CAMP CC 5465 0.0227486 0.0212690 0.0242935 0.0009990
FLGS LA 5444 0.0064339 0.0057135 0.0071633 0.0009990
FLGN LA 5426 0.0058392 0.0051129 0.0066202 0.0009990
FLGS CC 5454 0.0056693 0.0050236 0.0063266 0.0009990
FLGS MISS 5410 0.0055834 0.0048907 0.0063107 0.0009990
FLGS MB 5427 0.0053704 0.0047443 0.0060005 0.0009990
FLGS CS 5349 0.0054568 0.0044884 0.0064636 0.0009990
FLGN CC 5434 0.0047991 0.0041527 0.0054663 0.0009990
FLGN MISS 5394 0.0048103 0.0040661 0.0055689 0.0009990
FLGN CS 5330 0.0049619 0.0040366 0.0059292 0.0009990
FLGN MB 5415 0.0044797 0.0038322 0.0051598 0.0009990
LA MB 5424 0.0006684 0.0002496 0.0010927 0.0219780
CS LA 5353 0.0008099 0.0000898 0.0015353 0.0579421
CS MISS 5327 0.0007802 0.0000203 0.0015433 0.0669331
FLGS FLGN 5388 0.0002916 0.0000000 0.0007169 0.1668332
CS CC 5390 0.0005656 0.0000000 0.0012437 0.1098901
CS MB 5342 0.0005362 0.0000000 0.0012269 0.1228771
CC LA 5443 0.0003122 0.0000000 0.0007022 0.1228771
CC MISS 5443 0.0002717 0.0000000 0.0007181 0.1718282
CC MB 5458 0.0003393 0.0000000 0.0007430 0.0999001
LA MISS 5405 0.0004550 0.0000000 0.0009269 0.0929071
MISS MB 5420 0.0000070 0.0000000 0.0004469 0.4725275

The Campeche and Atlantic sample location exhibit the strongest differentiation, followed by all other Atlantic-Gulf comparisons. All pairwise comparisons of locations in the northern Gulf are significantly different from the southern Gulf sample location. Within the northern Gulf, the northern and southern Florida samples are significantly different from sample locations in the central and western sample location but are not different from each other (within eastern Gulf region), similarly most pairwise comparisons among sample locations within the central and western Gulf regions are not significantly different from each other.

8 Comparison of genetic diversity all loci

8.1 Calculate measures of diversity

Calculate measures of genetic diversity based on allele frequencies.

# define groups to analyze ====
setPop(gen) <- ~OCEAN

gen_oce <- seppop(gen)

setPop(gen) <- ~REGION

gen_reg <- seppop(gen)

setPop(gen) <- ~POP

gen_pop <- seppop(gen)

gen_grp <- c(gen_oce, gen_reg, gen_pop)

gen_grp[["ALL"]] <- gen

# calculate diversity stats ====
loc_stats <- list()

for (p in names(gen_grp)) {
  
locA <- locus_table(gen_grp[[p]], index = "shannon") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

locB <- locus_table(gen_grp[[p]], index = "simpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

temp <- left_join(locA, locB)

locC <- locus_table(gen_grp[[p]], index = "invsimpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

loc_stats[[p]] <- left_join(temp, locC)
  
}

loc_stats <- ldply(loc_stats, data.frame) %>%
  select(-Hexp) %>%
  rename(GRP = `.id`,
         SIMPSON_IDX = `X1.D`,
         N_ALLELES = allele,
         SHANNON_IDX = H,
         STODD_TAYLOR_IDX = G,
         EVENNESS = Evenness)

# calculate genetic diversity stats (heterozygosity-based) ====
loc_stats_2 <- list()

for (p in names(gen_grp)) {
  
dat <- genind2hierfstat(gen_grp[[p]])
stats <- basic.stats(dat)

loc_stats_2[[p]] <- stats$perloc %>%
  rownames_to_column("LOCUS")

}

# combine into single data frame ====
loc_stats_2 <- ldply(loc_stats_2, data.frame) %>%
  rename(GRP = `.id`)

loc_stats <- left_join(loc_stats, loc_stats_2) %>%
  select(GRP, LOCUS, N_ALLELES, EVENNESS, Ho, Hs, Ht, Fis, SHANNON_IDX, SIMPSON_IDX, STODD_TAYLOR_IDX)

loc_stats[is.na(loc_stats)] <- NA

write_delim(loc_stats, "results/gendiv.locstats", delim = "\t")

8.2 Expected heterozygosity (Nei’s gene diversity).

Hs calculated as within population gene diversity (expected heterozygosity) according to (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.).

Distribution of expected heterozygosity per estuary

Distribution of expected heterozygosity per estuary

Median, mean +/- std gene diversity by sample location.

GRP median mean std
FLA 0.2446 0.2714 0.2185
FLGS 0.2529 0.2822 0.2109
FLGN 0.2530 0.2834 0.2111
MB 0.2558 0.2840 0.2080
MISS 0.2508 0.2825 0.2104
CS 0.2605 0.2833 0.2170
LA 0.2522 0.2828 0.2095
CC 0.2526 0.2828 0.2077
CAMP 0.2559 0.2856 0.2099

Median, mean +/- std gene diversity by region.

REGION median mean std
SWATL 0.2446 0.2714 0.2185
EGULF 0.2530 0.2828 0.2110
CGULF 0.2554 0.2831 0.2112
WGULF 0.2526 0.2828 0.2077
SGULF 0.2559 0.2856 0.2099

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Hs and GRP and LOCUS
## Friedman chi-squared = 294.88, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

8.3 Allelic richness

Calculate allelic richness corrected for sample size using rarefaction.

Compare levels rarefied allele counts among estuaries.

Distribution of allelic richness measured as rarefied allele counts per estuary

Distribution of allelic richness measured as rarefied allele counts per estuary

Median, mean +/- std rarefied allelic richness by sample location.

pop median mean std
FLA 2.0000 2.2822 0.9950
FLGS 2.0000 2.4670 1.0675
FLGN 2.0000 2.4749 1.0711
MB 2.0038 2.4961 1.0724
MISS 2.0000 2.4921 1.0852
CS 2.0000 2.4712 1.1223
LA 2.0000 2.4905 1.0755
CC 2.0242 2.5004 1.0680
CAMP 2.0000 2.5967 1.1897

Median, mean +/- std rarefied alllelic richness by region.

REGION median mean std
SWATL 2.0000 2.2822 0.9950
EGULF 2.0000 2.4710 1.0693
CGULF 2.0000 2.4875 1.0890
WGULF 2.0242 2.5004 1.0680
SGULF 2.0000 2.5967 1.1897

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  AR and pop and LOCUS
## Friedman chi-squared = 1928.9, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

8.4 Evenness

Evenness is calculated as the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).

Distribution of evenness of allelic diversity per estuary

Distribution of evenness of allelic diversity per estuary

Median, mean +/- std evenness of allele frequencies by sample location.

GRP median mean std
FLA 0.6119 0.6328 0.1820
FLGS 0.5696 0.6029 0.1736
FLGN 0.5769 0.6083 0.1724
MB 0.5669 0.6005 0.1721
MISS 0.5727 0.6062 0.1710
CS 0.6119 0.6343 0.1686
LA 0.5667 0.6004 0.1724
CC 0.5650 0.5976 0.1711
CAMP 0.5650 0.6042 0.1628

Median, mean +/- std evenness of allele frequencies by region.

REGION median mean std
SWATL 0.6119 0.6328 0.1820
EGULF 0.5732 0.6056 0.1730
CGULF 0.5783 0.6099 0.1716
WGULF 0.5650 0.5976 0.1711
SGULF 0.5650 0.6042 0.1628

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  EVENNESS and GRP and LOCUS
## Friedman chi-squared = 225.98, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).

9 Comparison of genetic diversity outlier loci

9.1 Calculate measures of diversity

Calculate measures of genetic diversity based on allele frequencies.

# define groups to analyze ====
setPop(gen_o) <- ~OCEAN

gen_oce <- seppop(gen_o)

setPop(gen_o) <- ~REGION

gen_reg <- seppop(gen_o)

setPop(gen_o) <- ~POP

gen_pop <- seppop(gen_o)

gen_grp <- c(gen_oce, gen_reg, gen_pop)

gen_grp[["ALL"]] <- gen_o

# calculate diversity stats ====
loc_stats <- list()

for (p in names(gen_grp)) {
  
locA <- locus_table(gen_grp[[p]], index = "shannon") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

locB <- locus_table(gen_grp[[p]], index = "simpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

temp <- left_join(locA, locB)

locC <- locus_table(gen_grp[[p]], index = "invsimpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

loc_stats[[p]] <- left_join(temp, locC)
  
}

loc_stats <- ldply(loc_stats, data.frame) %>%
  select(-Hexp) %>%
  rename(GRP = `.id`,
         SIMPSON_IDX = `X1.D`,
         N_ALLELES = allele,
         SHANNON_IDX = H,
         STODD_TAYLOR_IDX = G,
         EVENNESS = Evenness)

# calculate genetic diversity stats (heterozygosity-based) ====
loc_stats_2 <- list()

for (p in names(gen_grp)) {
  
dat <- genind2hierfstat(gen_grp[[p]])
stats <- basic.stats(dat)

loc_stats_2[[p]] <- stats$perloc %>%
  rownames_to_column("LOCUS")

}

# combine into single data frame ====
loc_stats_2 <- ldply(loc_stats_2, data.frame) %>%
  rename(GRP = `.id`)

loc_stats <- left_join(loc_stats, loc_stats_2) %>%
  select(GRP, LOCUS, N_ALLELES, EVENNESS, Ho, Hs, Ht, Fis, SHANNON_IDX, SIMPSON_IDX, STODD_TAYLOR_IDX)

loc_stats[is.na(loc_stats)] <- NA

write_delim(loc_stats, "results/gendiv.o.locstats", delim = "\t")

9.2 Expected heterozygosity (Nei’s gene diversity).

Hs calculated as within population gene diversity (expected heterozygosity) according to (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.).

Distribution of expected heterozygosity per estuary

Distribution of expected heterozygosity per estuary

Median, mean +/- std gene diversity by sample location.

GRP median mean std
FLA 0.4010 0.3486 0.2014
FLGS 0.2166 0.2519 0.1908
FLGN 0.2419 0.2554 0.1869
MB 0.1674 0.2128 0.1790
MISS 0.1165 0.1975 0.1918
CS 0.1630 0.2175 0.1874
LA 0.1441 0.2106 0.1875
CC 0.1995 0.2182 0.1867
CAMP 0.2821 0.2742 0.1973

Median, mean +/- std gene diversity by region.

REGION median mean std
SWATL 0.4010 0.3486 0.2014
EGULF 0.2397 0.2536 0.1879
CGULF 0.1482 0.2096 0.1852
WGULF 0.1995 0.2182 0.1867
SGULF 0.2821 0.2742 0.1973

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Hs and GRP and LOCUS
## Friedman chi-squared = 34.905, df = 8, p-value = 0.00002782

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

9.3 Allelic richness

Calculate allelic richness corrected for sample size using rarefaction.

Compare levels rarefied allele counts among estuaries.

Distribution of allelic richness measured as rarefied allele counts per estuary

Distribution of allelic richness measured as rarefied allele counts per estuary

Median, mean +/- std rarefied allelic richness by sample location.

pop median mean std
FLA 2.0000 2.1606 0.7147
FLGS 1.9986 2.0739 0.6659
FLGN 1.9996 2.1229 0.6764
MB 2.0000 2.0027 0.6603
MISS 1.9616 1.9378 0.7307
CS 2.0000 2.0529 0.7399
LA 1.9860 1.9651 0.6567
CC 1.9993 1.9921 0.6841
CAMP 2.0000 2.1390 0.7786

Median, mean +/- std rarefied alllelic richness by region.

REGION median mean std
SWATL 2.0000 2.1606 0.7147
EGULF 1.9992 2.0984 0.6682
CGULF 1.9995 1.9896 0.6940
WGULF 1.9993 1.9921 0.6841
SGULF 2.0000 2.1390 0.7786

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  AR and pop and LOCUS
## Friedman chi-squared = 35.557, df = 8, p-value = 0.00002117

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

9.4 Evenness

Evenness is calculated as the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).

Distribution of evenness of allelic diversity per estuary

Distribution of evenness of allelic diversity per estuary

Median, mean +/- std evenness of allele frequencies by sample location.

GRP median mean std
FLA 0.8036 0.7607 0.1837
FLGS 0.6088 0.6464 0.1803
FLGN 0.6334 0.6334 0.1714
MB 0.5393 0.5923 0.1873
MISS 0.5174 0.5997 0.2006
CS 0.5802 0.6293 0.1993
LA 0.5428 0.6163 0.1989
CC 0.5789 0.6188 0.1918
CAMP 0.6764 0.6764 0.1644

Median, mean +/- std evenness of allele frequencies by region.

REGION median mean std
SWATL 0.8036 0.7607 0.1837
EGULF 0.6311 0.6398 0.1750
CGULF 0.5431 0.6092 0.1952
WGULF 0.5789 0.6188 0.1918
SGULF 0.6764 0.6764 0.1644

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  EVENNESS and GRP and LOCUS
## Friedman chi-squared = 24.907, df = 8, p-value = 0.001612

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).

10 Comparison of genetic diversity neutral loci

10.1 Calculate measures of diversity

Calculate measures of genetic diversity based on allele frequencies.

# define groups to analyze ====
setPop(gen_nn) <- ~OCEAN

gen_oce <- seppop(gen_nn)

setPop(gen_nn) <- ~REGION

gen_reg <- seppop(gen_nn)

setPop(gen_nn) <- ~POP

gen_pop <- seppop(gen_nn)

gen_grp <- c(gen_oce, gen_reg, gen_pop)

gen_grp[["ALL"]] <- gen_nn

# calculate diversity stats ====
loc_stats <- list()

for (p in names(gen_grp)) {
  
locA <- locus_table(gen_grp[[p]], index = "shannon") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

locB <- locus_table(gen_grp[[p]], index = "simpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

temp <- left_join(locA, locB)

locC <- locus_table(gen_grp[[p]], index = "invsimpson") %>%
  as.data.frame() %>%
  rownames_to_column("LOCUS")

loc_stats[[p]] <- left_join(temp, locC)
  
}

loc_stats <- ldply(loc_stats, data.frame) %>%
  select(-Hexp) %>%
  rename(GRP = `.id`,
         SIMPSON_IDX = `X1.D`,
         N_ALLELES = allele,
         SHANNON_IDX = H,
         STODD_TAYLOR_IDX = G,
         EVENNESS = Evenness)

# calculate genetic diversity stats (heterozygosity-based) ====
loc_stats_2 <- list()

for (p in names(gen_grp)) {
  
dat <- genind2hierfstat(gen_grp[[p]])
stats <- basic.stats(dat)

loc_stats_2[[p]] <- stats$perloc %>%
  rownames_to_column("LOCUS")

}

# combine into single data frame ====
loc_stats_2 <- ldply(loc_stats_2, data.frame) %>%
  rename(GRP = `.id`)

loc_stats <- left_join(loc_stats, loc_stats_2) %>%
  select(GRP, LOCUS, N_ALLELES, EVENNESS, Ho, Hs, Ht, Fis, SHANNON_IDX, SIMPSON_IDX, STODD_TAYLOR_IDX)

loc_stats[is.na(loc_stats)] <- NA

write_delim(loc_stats, "results/gendiv.n.locstats", delim = "\t")

10.2 Expected heterozygosity (Nei’s gene diversity).

Hs calculated as within population gene diversity (expected heterozygosity) according to (Nei 1987Nei, Masatoshi. 1987. Molecular Evolutionary Genetics. https://doi.org/10.7312/nei-92038.).

Distribution of expected heterozygosity per estuary

Distribution of expected heterozygosity per estuary

Median, mean +/- std gene diversity by sample location.

GRP median mean std
FLA 0.2404 0.2686 0.2176
FLGS 0.2508 0.2811 0.2105
FLGN 0.2530 0.2823 0.2108
MB 0.2545 0.2835 0.2075
MISS 0.2504 0.2823 0.2097
CS 0.2605 0.2830 0.2165
LA 0.2522 0.2824 0.2089
CC 0.2518 0.2822 0.2071
CAMP 0.2485 0.2836 0.2093

Median, mean +/- std gene diversity by region.

REGION median mean std
SWATL 0.2404 0.2686 0.2176
EGULF 0.2524 0.2817 0.2106
CGULF 0.2538 0.2828 0.2107
WGULF 0.2518 0.2822 0.2071
SGULF 0.2485 0.2836 0.2093

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  Hs and GRP and LOCUS
## Friedman chi-squared = 324.44, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

Results pairwise comparisons using Wilcoxon signed rank test (p-value printed).

10.3 Allelic richness

Calculate allelic richness corrected for sample size using rarefaction.

Compare levels rarefied allele counts among estuaries.

Distribution of allelic richness measured as rarefied allele counts per estuary

Distribution of allelic richness measured as rarefied allele counts per estuary

Median, mean +/- std rarefied allelic richness by sample location.

pop median mean std
FLA 1.9999 2.2735 0.9896
FLGS 2.0000 2.4611 1.0615
FLGN 2.0000 2.4683 1.0620
MB 2.0000 2.4918 1.0633
MISS 2.0000 2.4890 1.0774
CS 2.0000 2.4688 1.1167
LA 2.0000 2.4871 1.0687
CC 2.0000 2.4960 1.0614
CAMP 2.0000 2.5859 1.1789

Median, mean +/- std rarefied alllelic richness by region.

REGION median mean std
SWATL 1.9999 2.2735 0.9896
EGULF 2.0000 2.4647 1.0617
CGULF 2.0000 2.4842 1.0817
WGULF 2.0000 2.4960 1.0614
SGULF 2.0000 2.5859 1.1789

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  AR and pop and LOCUS
## Friedman chi-squared = 1942.1, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

Results of pairwise comparison of allelic richness among estuaries using Wilcoxon signed rank test (p-value).

10.4 Evenness

Evenness is calculated as the ratio of the number of abundant genotypes to the number of rarer genotypes calculated using the ratio of Stoddart & Tayolor index (diversity index weighted for more abundant alleles) & Shannon-Wiener index (diversity index weighted for more rare alleles).

Distribution of evenness of allelic diversity per estuary

Distribution of evenness of allelic diversity per estuary

Median, mean +/- std evenness of allele frequencies by sample location.

GRP median mean std
FLA 0.6071 0.6304 0.1815
FLGS 0.5681 0.6022 0.1737
FLGN 0.5755 0.6076 0.1726
MB 0.5663 0.6003 0.1720
MISS 0.5727 0.6062 0.1708
CS 0.6119 0.6342 0.1684
LA 0.5662 0.6001 0.1723
CC 0.5650 0.5973 0.1709
CAMP 0.5650 0.6031 0.1629

Median, mean +/- std evenness of allele frequencies by region.

REGION median mean std
SWATL 0.6071 0.6304 0.1815
EGULF 0.5714 0.6049 0.1732
CGULF 0.5783 0.6097 0.1715
WGULF 0.5650 0.5973 0.1709
SGULF 0.5650 0.6031 0.1629

Test for significant differences among estuaries using Friedman’s test.

## 
##  Friedman rank sum test
## 
## data:  EVENNESS and GRP and LOCUS
## Friedman chi-squared = 222, df = 8, p-value < 0.00000000000000022

Test for significant pairwise differences between estuaries using Wilcoxon signed rank test.

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).

Results of pairwise comparisons of evenness of allelic diversity between estuaries (p-values printed).